NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong

Si, Chenglei; Goyal, Navita; Wu, Sherry Tongshuang; Zhao, Chen; Feng, Shi; Daumé-III, Hal; Boyd-Graber, Jordan (June 2023, 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024))

Large Language Models (LLMs) are increasingly used for accessing information on the web. Their truthfulness and factuality are thus of great interest. To help users make the right decisions about the information they get, LLMs should not only provide information but also help users fact-check it. Our experiments with 80 crowdworkers compare language models with search engines (information retrieval systems) at facilitating fact-checking. We prompt LLMs to validate a given claim and provide corresponding explanations. Users reading LLM explanations are significantly more efficient than those using search engines while achieving similar accuracy. However, they over-rely on the LLMs when the explanation is wrong. To reduce over-reliance on LLMs, we ask LLMs to provide contrastive information - explain both why the claim is true and false, and then we present both sides of the explanation to users. This contrastive explanation mitigates users' over-reliance on LLMs, but cannot significantly outperform search engines. Further, showing both search engine results and LLM explanations offers no complementary benefits compared to search engines alone. Taken together, our study highlights that natural language explanations by LLMs may not be a reliable replacement for reading the retrieved passages, especially in high-stakes settings where over-relying on wrong AI explanations could lead to critical consequences.
more » « less
Full Text Available
Multi-Step Reasoning Over Unstructured Text with Beam Dense Retrieval

https://doi.org/10.18653/v1/2021.naacl-main.368

Zhao, Chen; Xiong, Chenyan; Boyd-Graber, Jordan; Daumé III, Hal (January 2021, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies)

Complex question answering often requires finding a reasoning chain that consists of multiple evidence pieces. Current approaches incorporate the strengths of structured knowledge and unstructured text, assuming text corpora is semi-structured. Building on dense retrieval methods, we propose a new multi-step retrieval approach (BEAMDR) that iteratively forms an evidence chain through beam search in dense representations. When evaluated on multi-hop question answering, BEAMDR is competitive to state-of-the-art systems, without using any semi-structured information. Through query composition in dense space, BEAMDR captures the implicit relationships between evidence in the reasoning chain. The code is available at https://github.com/ henryzhao5852/BeamDR.
more » « less
Full Text Available
On the Potential of Lexico-logical Alignments for Semantic Parsing to SQL Queries

https://doi.org/10.18653/v1/2020.findings-emnlp.167

Shi, Tianze; Zhao, Chen; Boyd-Graber, Jordan; Daumé III, Hal; Lee, Lillian (January 2020, Findings of the Association for Computational Linguistics: EMNLP 2020)
null (Ed.)
Large-scale semantic parsing datasets annotated with logical forms have enabled major advances in supervised approaches. But can richer supervision help even more? To explore the utility of fine-grained, lexical-level supervision, we introduce SQUALL, a dataset that enriches 11,276 WIKITABLEQUESTIONS English-language questions with manually created SQL equivalents plus alignments between SQL and question fragments. Our annotation enables new training possibilities for encoderdecoder models, including approaches from machine translation previously precluded by the absence of alignments. We propose and test two methods: (1) supervised attention; (2) adopting an auxiliary objective of disambiguating references in the input queries to table columns. In 5-fold cross validation, these strategies improve over strong baselines by 4.4% execution accuracy. Oracle experiments suggest that annotated alignments can support further accuracy gains of up to 23.9%.
more » « less
Full Text Available
Prior and Prejudice: The Novice Reviewers' Bias against Resubmissions in Conference Peer Review

https://doi.org/10.1145/3449149

Stelmakh, Ivan; Shah, Nihar_B; Singh, Aarti; Daumé, III, Hal (April 2021, Proceedings of the ACM on Human-Computer Interaction)

Modern machine learning and computer science conferences are experiencing a surge in the number of submissions that challenges the quality of peer review as the number of competent reviewers is growing at a much slower rate. To curb this trend and reduce the burden on reviewers, several conferences have started encouraging or even requiring authors to declare the previous submission history of their papers. Such initiatives have been met with skepticism among authors, who raise the concern about a potential bias in reviewers' recommendations induced by this information. In this work, we investigate whether reviewers exhibit a bias caused by the knowledge that the submission under review was previously rejected at a similar venue, focusing on a population of novice reviewers who constitute a large fraction of the reviewer pool in leading machine learning and computer science conferences. We design and conduct a randomized controlled trial closely replicating the relevant components of the peer-review pipeline with $133$ reviewers (master's, junior PhD students, and recent graduates of top US universities) writing reviews for $19$ papers. The analysis reveals that reviewers indeed become negatively biased when they receive a signal about paper being a resubmission, giving almost 1 point lower overall score on a 10-point Likert item (Δ = -0.78, 95% CI = [-1.30, -0.24]) than reviewers who do not receive such a signal. Looking at specific criteria scores (originality, quality, clarity and significance), we observe that novice reviewers tend to underrate quality the most.
more » « less

Search for: All records